STATS 32 Session 8: Reproducible research
Kenneth Tay
Oct 17, 2019
Recap of session 7
- Importing data with
readr
- Where does your data live?
- Factors
File paths and working directories
- A character string that tells you the location of a file
- Absolute path: starts from the “root” directory
- e.g.
/Users/kjytay/Downloads/datafile.csv
- Relative path: starts from the current directory (denoted by
.)
- e.g. If I am in the folder
/Users/kjytay: ./Downloads/datafile.csv
- e.g. If I am in the folder
/Users/kjytay/Downloads: ./datafile.csv or simply datafile.csv
File paths and working directories
- Working directory: where R looks for files that you ask it to load
- Also where R will put any files that you ask it to save
- You can see your current working directory at the top of the console or by typing
getwd()
- You can change the working directory with
setwd() function or Session > Set Working Directory > …
Factors
- A concept unique to R
- Useful for working with categorical variables: variables that have a fixed and known set of possible values
- Why use factor variables instead of character variables?
- Character variables don’t protect you from typos
- Character variables don’t sort in a useful way
Functions for factors
fct_recode(): change factor levels
fct_collapse() and fct_lump(): reduce the number of factor levels
fct_infreq(): to sort factor levels by how often they appear
fct_reorder(): to sort factor levels by some other variable
fct_rev(): reverse the order of the factor levels
All these functions are part of the forcats package, which is automatically loaded when you load the tidyverse package.
Reproducible research: what & why
Reproducible research: publishing data analyses together with their data and code so that others may “reproduce” the findings.
Why reproducible research?
- Increase transparency and robustness of analyses
- Preserve integrity of analyses over time
- Reduce incentive for dishonest practices
R scripts
- An R script is a file containing lines of R code that are meant to be run altogether
- R scripts are typically working files, not intended for presentation
- R scripts have
.R file extensions
- Comments can be inserted to explain the code
R markdown
RStudio: R markdown is a document format which allows you to “weave together narrative text and code to produce elegantly formatted output.”
Made possible by the knitr package (Yihui Xie)
R markdown: output (1)
R markdown: output (2)
R markdown: output (3)
R markdown: more details
- Text (written in Markdown), interspersed with code chunks, “knit” into a document using the
knitr package
- Typically used for presentation
- R markdown files have
.Rmd extensions
- R markdown cheatsheet and reference guide available here
Surprise: (Almost) all the class material (including slides) was created with R markdown!
Quick intro to Markdown
Markdown is a simple way to convert a text document into a web file (i.e. HTML) with basic styling.
Has support for:
- Headers
- Emphasis (italics, bold,
strikethrough)
- Lists
- Links
- Images
- Etc…
Markdown reference here.
To see how your Markdown (.md) document looks like in real-time, use an online Markdown editor (e.g. dillinger.io)
Today’s dataset: Airbnb listings
Rmd workflow (basic)
- Edit
.Rmd file in RStudio.
- Knit the document (either by hitting the “Knit” button or using a keyboard shortcut).
- When you press “Knit”, the file is automatically saved.
- Next, RStudio opens a new console, “knits” the document there, then closes that console. No code is run in your original console!
- RStudio creates a
.html file in the same folder as the .Rmd file.
- Preview output in the preview pane, or by opening the
.html file.
- If you want to make changes, go back to Step 1.
Common Rmd chunk options
include = FALSE: prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
- Useful for decluttering your Rmd output, showing only essential code.
echo = FALSE: prevents code, but not the results from appearing in the finished file.
- Useful if you just want to show figures but not code that generated it.
eval = FALSE: Code appears in the output but is not run.
- Useful for presenting code for demonstration purposes.
message = FALSE: prevents messages that are generated by code from appearing in the finished file.
- Useful for suppressing messages when loading packages.
warning = FALSE: prevents warnings that are generated by code from appearing in the finished.
- Useful for suppressing warnings when loading packages, plotting data or fitting models.